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Abstract 


Positive definite kernels on probability measures have been recently applied to classification prob- 
lems involving text, images, and other types of structured data. Some of these kernels are related 
to classic information theoretic quantities, such as (Shannon’s) mutual information and the Jensen- 
Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive gener- 
alizations of Shannon’s information theory. This paper bridges these two trends by introducing 
nonextensive information theoretic kernels on probability measures, based on new JS-type diver- 
gences. These new divergences result from extending the the two building blocks of the classical 
JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider 
concept of g-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we in- 
troduce Jensen-Tsallis (JT) g-differences, a nonextensive generalization of the JS divergence, and 
define a k-th order JT q-difference between stochastic processes. We then define a new family of 
nonextensive mutual information kernels, which allow weights to be assigned to their arguments, 
and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string 
kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of 
these kernels on text categorization tasks, in which documents are modeled both as bags of words 
and as sequences of characters. 


Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, Jensen- 
Shannon divergence, string kernels 
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1. Introduction 


In kernel-based machine learning (Schélkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004), 
there has been recent interest in defining kernels on probability distributions to tackle several prob- 
lems involving structured data (Desobry et al., 2007; Moreno et al., 2004; Jebara et al., 2004; Hein 
and Bousquet, 2005; Lafferty and Lebanon, 2005; Cuturi et al., 2005). By defining a parametric 
family S containing the distributions from which the data points (in the input space X) are assumed 
to have been generated, and defining a map from X from S (e.g., via maximum likelihood estima- 
tion), a distribution in S may be fitted to each datum. Therefore, a kernel that is defined on S x $ 
automatically induces a kernel on X x X, through map composition. In text categorization, this 
framework appears as an alternative to the Euclidean geometry inherent to the usual bag-of-words 
representations. In fact, approaches that map data to statistical manifolds, equipped with well- 
motivated non-Euclidean metrics (Lafferty and Lebanon, 2005), often outperform support vector 
machine (SVM) classifiers with linear kernels (Joachims, 2002). Some of these kernels have a 
natural information theoretic interpretation, establishing a bridge between kernel methods and in- 
formation theory (Cuturi et al., 2005; Hein and Bousquet, 2005). 

The main goal of this paper is to widen that bridge; we do that by introducing a new class of ker- 
nels rooted in nonextensive information theory, which contains previous information theoretic ker- 
nels as particular elements. The Shannon and Rényi entropies (Shannon, 1948; Rényi, 1961) share 
the extensivity property: the joint entropy of a pair of independent random variables equals the sum 
of the individual entropies. Abandoning this property yields the so-called nonextensive entropies 
(Havrda and Charvat, 1967; Lindhard, 1974; Lindhard and Nielsen, 1971; Tsallis, 1988), which 
have raised great interest among physicists in modeling phenomena such as long-range interactions 
and multifractals, and in constructing nonextensive generalizations of Boltzmann-Gibbs statisti- 
cal mechanics (Abe, 2006). Nonextensive entropies have also been recently used in signal/image 
processing (Li et al., 2006) and other areas (Gell-Mann and Tsallis, 2004). The so-called Tsal- 
lis entropies (Havrda and Charvat, 1967; Tsallis, 1988) form a parametric family of nonextensive 
entropies that includes the Shannon-Boltzmann-Gibbs entropy as a particular case. Nonextensive 
generalizations of information theory have been proposed (Furuichi, 2006). 

Convexity and Jensen’s inequality are key concepts underlying several central results of infor- 
mation theory, for example, the non-negativity of the Kullback-Leibler (KL) divergence (or rela- 
tive entropy) (Kullback and Leibler, 1951). Jensen’s inequality (Jensen, 1906) also underlies the 
Jensen-Shannon (JS) divergence, a symmetrized and smoothed version of the KL divergence (Lin 
and Wong, 1990; Lin, 1991), often used in statistics, machine learning, signal/image processing, 
and physics. 

In this paper, we introduce new extensions of JS-type divergences by generalizing its two pil- 
lars: convexity and Shannon’s entropy. These divergences are then used to define new information- 
theoretic kernels between probability distributions. More specifically, our main contributions are: 


e The concept of g-convexity, generalizing that of convexity, for which we prove a Jensen q- 
inequality. The related concept of Jensen q-differences, which generalize Jensen differences, 
is also proposed. Based on these concepts, we introduce the Jensen-Tsallis (JT) q-difference, 
a nonextensive generalization of the JS divergence, which is also a “mutual information” in 
the sense of Furuichi (2006). 


e Characterization of the JT q-difference, with respect to convexity and extrema, extending 
work by Burbea and Rao (1982) and by Lin (1991) for the JS divergence. 


936 


NONEXTENSIVE INFORMATION THEORETIC KERNELS ON MEASURES 


e Definition of k-th order joint and conditional JT qg-differences for families of stochastic pro- 
cesses, and derivation of a chain rule. 


e A broad family of (nonextensive information theoretic) positive definite kernels, interpretable 
as nonextensive mutual information kernels, ranging from the Boolean to the linear kernels, 
and including the JS kernel proposed by Hein and Bousquet (2005). 


e A family of (nonextensive information theoretic) positive definite kernels between stochastic 
processes, subsuming well-known string kernels (e.g., the p-spectrum kernel) (Leslie et al., 
2002). 


e Extensions of results of Hein and Bousquet (2005) proving positive definiteness of kernels 
based on the unbalanced JS divergence. A connection between these new kernels and those 
studied by Fuglede (2005) and Hein and Bousquet (2005) is also established. In passing, we 
show that the parametrix approximation of the multinomial diffusion kernel introduced by 
Lafferty and Lebanon (2005) is not positive definite in general. 


The paper is organized as follows. Section 2 reviews nonextensive entropies, with empha- 
sis on the Tsallis case. Section 3 discusses Jensen differences and divergences. The concepts 
of q-differences and g-convexity are introduced in Section 4, where they are used to define and 
characterize some new divergence-type quantities. In Section 5, we define the Jensen-Tsallis q- 
difference and derive some of its properties; in that section, we also define k-th order Jensen-Tsallis 
q-differences for families of stochastic processes. The new family of entropic kernels is introduced 
and characterized in Section 6, which also introduces nonextensive kernels between stochastic pro- 
cesses. Experiments on text categorization are reported in Section 7. Section 8 concludes the paper 
and discusses future research. 


2. Nonextensive Entropies and Tsallis Statistics 


In this section, we start with a brief overview of nonextensive entropies. We then introduce the 
family of Tsallis entropies, and extend their domain to unnormalized measures. 


2.1 Nonextensivity 


In what follows, R denotes the nonnegative reals, R,+ denotes the strictly positive reals, and 


n 
Aal e fe € R" | pie? = 1, Vix; =o} 


i=1 


denotes the (n — 1)-dimensional simplex. 

Inspired by the axiomatic formulation of Shannon’s entropy (Khinchin, 1957; Shannon and 
Weaver, 1949), Suyari (2004) proposed an axiomatic framework for nonextensive entropies and 
a uniqueness theorem. Let g > 0 be a fixed scalar, called the entropic index. Suyari’s axioms 
(Appendix A) determine a function S49 : A”! — R of the form 


sty (1-LEi pf) if 1 
Salono) =| aa ( 1 pt) q# 


1 
Sirenia atest, i 
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where k is a positive constant, and ọ : R+ — R is a continuous function that satisfies the following 
three conditions: (i) 0(q) has the same sign as q— 1; (ii) 0(q) vanishes if and only if q = 1; (iii) 6 is 
differentiable in a neighborhood of 1 and ọ'(1) = 1 

Note that S1 = limg—1 Sq,ọ, thus Sao (Pigs <a Pn) seen as a function of q, is continuous at 
q= 1. For any ọ satisfying these conditions, Sq, has the pseudoadditivity property: for any two 
independent random variables A and B, with probability mass functions p4 € A’4~! and pg € A"37!}, 
respectively, consider the new random variable A &® B defined by the joint distribution p4 ® pg € 
A”ane— l. then, 


S49(A BB) = 549A) +5498) — O 8,9(4)5,0(8), 


where we denote (as usual) Sg, (4) = S4,(pa). 
For q = 1, Suyari’s axioms recover the Shannon-Boltzmann-Gibbs (SBG) entropy, 


n 
Sio(P1)--->Pn) =H (P1, ---; Pn) = —k È piln pi, 
i=l 
and pseudoadditivity turns into additivity, that is, H(A @ B) = H(A) + H(B) holds. 
Several proposals for ọ have appeared in the literature (Havrda and Charvát, 1967; Daróczy, 
1970; Tsallis, 1988). In this article, unless stated otherwise, we set ¢(q) = q — 1, which yields the 
Tsallis entropy: 


k n 
Sq(P1y-++5 Pn) = => (Zr). (2) 
q— i=l 


To simplify, we let k = 1 and write the Tsallis entropy as 


Sq(X) Ê Sq(pi,---sPn) =- $ p(x)! Ing p(x) (3) 
xEX 
where Ing(x) = (x!~4—1)/(1 — q) is the g-logarithm function, which satisfies Ing(xy) = Ing(x) + 
x!~@1n,(y) and Ing(1/x) = —x4~! In, (x). This notation was introduced by Tsallis (1988). 


2.2 Tsallis Entropies 


Furuichi (2006) derived some information theoretic properties of Tsallis entropies. Tsallis joint and 
conditional entropies are defined, respectively, as 


Sq(X,Y) =—) p(x,y)" In, p(x,y) 
xy 


and 
Sq(X|Y) = -È p(y)" Ing ply) = $} p(y) SX), (4) 
xy y 
and the chain rule S,(X,Y) = Sq(X) + S,(Y|X) holds. 
For two probability mass functions px, py € A”, the Tsallis relative entropy, generalizing the 
KL divergence, is defined as 


A 


D4(px|lpy) = -APG x)Ing 2 Gy (5) 
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Finally, the Tsallis mutual entropy is defined as 
Ig(X3¥) = Sq(X) —Sq(X|¥) = Sq(¥) — Sq(¥ |X), (6) 


generalizing (for g > 1) Shannon’s mutual information (Furuichi, 2006). In Section 5, we establish 
a relationship between Tsallis mutual entropy and a quantity called Jensen-Tsallis q-difference, 
generalizing the one between mutual information and the JS divergence (shown, e.g., by Grosse 
et al. 2002, and recalled below, in Section 3.2). 

Furuichi (2006) also mentions an alternative generalization of Shannon’s mutual information, 
defined as 


I,(X;Y) = Dg(px,v||px® py), (7) 


where py y is the true joint probability mass function of (X,Y) and px ® py denotes their joint 
probability if they were independent. This alternative definition of a “Tsallis mutual entropy” has 
also been used by Lamberti and Majtey (2003); notice that I,(X;Y) 4 1,(X;Y) in general, the case 
q = | being a notable exception. In Section 5, we show that this alternative definition also leads to 
a nonextensive analogue of the JS divergence. 


2.3 Entropies of Measures and Denormalization Formulae 


Throughout this paper, we consider functionals that extend the domain of the Shannon-Boltzmann- 
Gibbs and Tsallis entropies to include unnormalized measures. Although, as shown below, these 
functionals are completely characterized by their restriction to the normalized probability distri- 
butions, the denormalization expressions will play an important role in Section 6 to derive novel 
positive definite kernels inspired by mutual informations. 

In order to keep generality, whenever possible we do not restrict to finite or countable sample 
spaces. Instead, we consider a measure space (X,.W,v) where X is Hausdorff and v is a 0-finite 
Radon measure. We denote by M,(X) the set of finite Radon v-absolutely continuous measures 
on X, and by MŁ} (X) the subset of those which are probability measures. For simplicity, we often 
identify each measure in M,(X) or M! (X) with its corresponding nonnegative density; this is 
legitimated by the Radon-Nikodym theorem, which guarantees the existence and uniqueness (up 
to equivalence within measure zero) of a density function f : X — R+. In the sequel, Lebesgue- 
Stieltjes integrals of the form f4 f(x)dv(x) are often written as ff, or simply ff, if A= X. 
Unless otherwise stated, v is the Lebesgue-Borel measure, if X C R” and intX Æ Ø, or the counting 
measure, if X is countable. In the latter case integrals can be seen as finite sums or infinite series. 

Define R = RU {—co, +00}. For some functional G : M;(X) — R, let the set M9(X) ê {f € 
M(X) :|G(f)| < œ} be its effective domain, and Myo (X) = M$(X) AM} (X) be its subdomain 
of probability measures. 

The following functional (Cuturi and Vert, 2005), extends the Shannon-Boltzmann-Gibbs en- 
tropy from M!” (X) to the unnormalized measures in MŽ (X): 


H(f)=—k | fing= | ouef, (8) 
where k > 0 is a constant, the function Qy : R+} — R is defined as 
Pa) = —kylny, 
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and, as usual, 0ln0 + 0. 

The generalized form of the KL divergence, often called generalized I-divergence (Csiszar, 
1975), is a directed divergence between two measures Uf, Ug € Mt (X), such that up is 4g-absolutely 
continuous (denoted uf < ug). Let f and g be the densities associated with uy and ug, respectively. 
In terms of densities, this generalized KL divergence is 


D(f,g) = kf (e-r+sin£). (9) 


Let us now proceed similarly with the nonextensive entropies. For q > 0, let m“ (X)={f€ 


M(X) : f1 E€ M4 (X); for q # 1, and Me (X) = MË (X) for q = 1. The nonextensive counterpart 
of (8), defined on mM“ (X), is 


Salf) = | oet. (10) 


where @, : R+ — Ris given by 


u(y) ifq=1, 
eb)={ se (y-y) if #1, (11) 


and @: R4 — R satisfies conditions (i)-(iii) stated following Equation (1). The Tsallis entropy is 
obtained for (¢) =q—1, 


Saf) =—k | fing f (12) 


Similarly, a nonextensive generalization of the generalized KL divergence (9) is 


Da(f,8) = 





k 
ia L (af +1 -q)8- fig"), 
for q #1, and D1 (f,g) Ê limg—>1 D4(f,8) = D(f,8). 

Define |f| = f f = up(X). For |f| = |g| = 1, several particular cases are recovered: if ¢(q) = 
1—2!~4, then D;( f,g) is the Havrda-Charvát relative entropy (Havrda and Charvát, 1967; Daróczy, 
1970); if o(¢) = q — 1, then D,(f,g) is the Tsallis relative entropy (5); finally, if ¢(q4) = q(q — 1), 
then D,(f,g) is the canonical æ-divergence defined by Amari and Nagaoka (2001) in the realm of 
information geometry (with the reparameterization & = 2q — 1 and assuming q > 0 so that o(q) = 
q(q— 1) conforms with the axioms). 


Remark 1 Both functionals Sg and Dg are completely determined by their restriction to the nor- 
malized measures. Indeed, the following equalities hold for any c € R14 and f,g € Me (X), with 
Hf K Hg: 





Salf) = ASP) + flol), 
D,(cf,cg) 5 cD,(f.g), 
Dilef.g) = c!Dg(f.8)—49q(0)IfI4 xa! 1)(1—e4)|gl. (13) 
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For any f € MŽ (X) and g € M“ (Y), 


(4) 


Salf 88) = |glSq(f) +1 F1Sq(8) — —Sa(F)Sa(a)- 


If |f| = |g| = 1, we recover the pseudo-additivity property of nonextensive entropies: 


Sq(f ® 8) = Sq(f) +Sq(8) — ——Sq(f)Sq(8)- 
For 6(q) =q —1, Dg is the Tsallis relative entropy and (13) reduces to 


Dg(cf,8) = c1Dg(f,8) — a9q()If| HKA — c%) |g]. 
By taking the limit q — 1, we obtain the following formulae for H and D: 
H(cf) cH(f)+|f|Ou(c), 


D(cf,g) = cD(f,g)—|fla(c)+k(1—c) |g. 


Consider f € MŽ (X) and g € MŽ (Y), and define f Qg € Mi (x x Y) as (f @g)(x,y) = f(x)g(y). 
Then, 


H(f 88) = |s| E(f) +|f|A(g)- 
If |f| = |e| = 1, we recover the additivity property of the Shannon-Boltzmann-Gibbs entropy, H (f ® 
8) =H(f)+H(g). 


3. Jensen Differences and Divergences 


In this section, we review the concept of Jensen difference. We then discuss three particular cases: 
the Jensen-Shannon, Jensen-Rényi, and Jensen-Tsallis divergences. 


3.1 The Jensen Difference 


Jensen’s inequality (Jensen, 1906) is at the heart of many important results in information theory. 
Let E|.| denote the expectation operator. Jensen’s inequality states that if Z is an integrable random 
variable taking values in a set Z, and f is a measurable convex function defined on the convex hull 
of Z, then 

F(E[Z]) < E[f(Z)). 


Burbea and Rao (1982) considered the scenario where Z is finite, and took f £ —Hg, where 
Hg: {a,b|" — R is a concave function, called a @-entropy, defined as 


(zi), (14) 


Me: 


Hg(z) 2 


i=1 


where @ : [a,b] — Ris convex. They studied the Jensen difference 


m m 
Jo 1s- ++ Ym) = Ho z n») E L m Holy), 
t=1 t=1 


941 


MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO 


where n = (T1,..., Tm) E€ A” |, and each yy,...,ym € [a,b]". 
We consider here a more general scenario, involving two measure sets (X,.W,v) and (T, 7,T), 
where the second is used to index the first. 


Definition 2 Let u Ê (u;);e¢ € [M(X)] be a family of finite Radon measures on X, indexed by 
T, and let œ E€ M, (T) be a finite Radon measure on T. Define: 


19) ê ¥( [ omdat) - fo u)a) (15) 


where: 
(i) Y is a concave functional such that dom C M, (Xx); 
(ii) @(t) u(x) is t-integrable, for all x € X; 
(iii) fy O(t)udt(t) € dom P; 
(iv) u E€ domY, for allt € T; 
V) w(t) (u) is t-integrable. 
Ifo € M} (T), we still call (15) a Jensen difference. 


In the following subsections, we consider several instances of Definition 2, leading to several 
Jensen-type divergences. 


3.2 The Jensen-Shannon Divergence 


Let p be a random probability distribution taking values in {p;},<¢ according to a distribution 
TEM I(T ). (In classification/estimation theory parlance, m is called the prior distribution and 
P: = p(.|t) the likelihood function.) Then, (15) becomes 


Jg(p) =Y (Elp]) — E[¥(p)}, (16) 


where the expectations are with respect to 7. 

Let now Y = H, the Shannon-Boltzmann-Gibbs entropy. Consider the random variables T and 
X, taking values respectively in 7 and X, with densities n(r) and p(x) = fy p(a|t)x(t). Using 
standard notation of information theory (Cover and Thomas, 1991), 


Ftp) SRo) = (f rop) f aOR) 


= H(X)— | aKT =t) 
— H(X)—H(x|T) 
= I(X;T), (17) 


where /(X;7)) is the mutual information between X and T. (This relationship between JS divergence 
and mutual information was pointed out by Grosse et al. 2002.) Since /(X;T) is also equal to the 
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KL divergence between the joint distribution and the product of the marginals (Cover and Thomas, 
1991), we have 
J*(p) =H (E|p]) — E[A(p)] = E[D(p||E[p))]- (18) 


When X and 7 are finite with |T| = m, J7,(p1,-.--,Pm) is called the Jensen-Shannon (JS) di- 
vergence of p1,..-,Pm, With weights 7,...,7,, (Burbea and Rao, 1982; Lin, 1991). Equality (18) 
allows two interpretations of the JS divergence: 


e the Jensen difference of the Shannon entropy of p; 
e the expected KL divergence from p to the expectation of p. 


A remarkable fact is that J" (p) = min, E|D(p||r)], that is, r* = E[p] is a minimizer of E[D(p]|r)] 
with respect to r. It has been shown that this property together with Equality (18) characterize the 
so-called Bregman divergences: they hold not only for ¥ = H, but for any concave ¥ and the 
corresponding Bregman divergence, in which case Jẹ is the Bregman information (Banerjee et al., 
2005). 

When |T| = 2 and m = (1/2,1/2), p may be seen as a random distribution whose value on 
{ pi, p2} is chosen by tossing a fair coin. In this case, J!/%1/2 (p) = JS(p1, p2), where 


n (2) H(pı) +H (p2) 





l> 


JS(pı, p2) 5) 5) 


1 Pit p2 1 Pit p2 
zP (e| zP (el 
7 (>, 5 )+5 (v2 5 


as introduced by Lin (1991). It has been shown that JS satisfies the triangle inequality (hence 
being a metric) and that, moreover, it is a Hilbertian metric! (Endres and Schindelin, 2003; Topsøe, 
2000), which has motivated its use in kernel-based machine learning (Cuturi et al., 2005; Hein and 
Bousquet, 2005) (see Section 6). 








3.3 The Jensen-Rényi Divergence 


Consider again the scenario above (Section 3.2), with the Rényi qg-entropy 


1 
R,(p) = rfe 


replacing the Shannon-Boltzmann-Gibbs entropy. It is worth noting that the Rényi and Tsallis 





q-entropies are monotonically related through R¿(p) = in({I +(1—q)Sq(p)| 5), or, using the q- 
logarithm function, 
Sq(p) = IngexpR4(p). 


The Rényi g-entropy is concave for q € [0, 1) and has the Shannon-Boltzmann-Gibbs entropy as 
the limit when q — 1. Letting Y = R4, (16) becomes 


JR, (P) = Ry (E[p]) — E[R;(p)]. (19) 





1. A metric d : X x X — R is Hilbertian if there is some Hilbert space H and an isometry f : X — H such that 
d*(x,y) = (f(x) — f(y), f(x) — f(y) y holds for any x,y € X (Hein and Bousquet, 2005). 
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Unlike in the JS divergence case, there is no counterpart of equality (18) based on the Rényi q- 


divergence 
Ve 
jin | pip, * 


When X and 7 are finite, we call JR, in (19) the Jensen-Rényi (JR) divergence. Furthermore, 
when |Z| = 2 and m= (1/2,1/2), we write JR (p) = JRq(p1, p2), where 





Dr, (Pillp2) = 





+ Ry(pi) +R 
Walp pa) = Ry ( PP) -2D Eea, 


2 2 


The JR divergence has been used in several signal/image processing applications, such as regis- 
tration, segmentation, denoising, and classification (Ben-Hamza and Krim, 2003; He et al., 2003; 
Karakos et al., 2007). In Section 6, we show that the JR divergence is (like the JS divergence) a 
Hilbertian metric, which is relevant for its use in kernel-based machine learning. 


3.4 The Jensen-Tsallis Divergence 


Burbea and Rao (1982) have defined Jensen-type divergences of the form (16) based on the Tsallis 
q-entropy Sq, defined in (12). Like the Shannon-Boltzmann-Gibbs entropy, but unlike the Rényi 
entropies, the Tsallis g-entropy, for finite 7, is an instance of a @-entropy (see Equation 14). Letting 
Y = S4, (16) becomes 

FE (p) = Sq (Elp]) — E[S4(p)]. (20) 


Again, as in Section 3.3, if we consider the Tsallis g-divergence, 


1 ng 
D4(Pi||P2) = ig (1 - | m'pa ‘) , 


there is no counterpart of the Equality (18). 

When X and 7 are finite, J? 5, in (20) is called the Jensen-Tsallis (JT) divergence and it has also 
been applied in image processing (Ben-Hamza, 2006). Unlike the JS divergence, the JT divergence 
lacks an interpretation as a mutual information. Despite this, for q € [1,2], it exhibits joint convexity 
(Burbea and Rao, 1982). In the next section, we propose an alternative to the JT divergence which, 
among other features, is interpretable as a nonextensive mutual information (in the sense of Furuichi 
2006) and is jointly convex, for q € [0,1]. 


4. g-Convexity and g-Differences 


This section introduces a novel class of functions, termed Jensen q-differences, which generalize 
Jensen differences. Later (in Section 5), we will use these functions to define the Jensen-Tsallis q- 
difference, which we will propose as an alternative nonextensive generalization of the JS divergence, 
instead of the JT divergence discussed in Section 3.4. We begin by recalling the concept of q- 
expectation (Tsallis, 1988). 


Definition 3 The unnormalized q-expectation of a random variable X, with probability density p, 
is 


E(x] E | xpo). 
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Of course, q = 1 corresponds to the standard notion of expectation. For q Æ 1, the g-expectation 
does not match the intuitive meaning of average/expectation (e.g., Eg[1] 4 1, in general). The q- 
expectation is a convenient concept in nonextensive information theory; for example, it yields a very 
compact form for the Tsallis entropy: Sj(X) = —Eg|Ing p(X)]. 


4.1 g-Convexity 


We now introduce the novel concept of g-convexity and use it to derive a set of results, namely the 
Jensen q-inequality. 


Definition 4 Let q € Rand X be a convex set. A function f : X — R is q-convex if for any x,y E€ X 
and À € [0,1], 
fAx+(1—A)y) < ATF (x) H-A). (21) 


If —f is q-convex, f is said to be q-concave. 


Of course, 1-convexity is the usual notion of convexity. Many properties of 1-convex functions 
do not have g-analogues. For example, for q Æ 1, any g-convex function must be either nonnegative 
(if q < 1) or nonpositive (if q > 1); this simple fact can be shown through reductio ad absurdum 
by setting x = y in (21). However, other properties remain: the next proposition states the Jensen 
q-inequality. 


Proposition 5 If f : X — R is q-convex, then for any n € N, x1,...,Xn E X and n = (T1,... Tn) € 
n-1 


n n 
F| Yo mixi | < X nf f(z). 
i=l i=l 
Moreover, if f is continuous, the above still holds for countably many points (x;)ien. 


Proof In the finite case, the proof can be carried out by induction, as in the proof of the standard 
Jensen inequality (Cover and Thomas, 1991). Assuming that the inequality holds for n € N, then, 
from the definition of g-convexity, it will also hold for n+ 1: 


n+l n 
n na) = p (Erata) 
i=l 


i=l 


i=1 


n 
f ( —%n+1) Y mxi tamis) 


< mats ( nix) +741 f (Xn41) 
i=l 
n n+l 
< E Tif) Sent) = Yni fi), 
j i=l 


i=1 


where we used the fact that t,4; = 1—Y%_,7;, and we defined T; 4 7;/(1 — +1) (note that 
n , 1, = 1.) Furthermore, if f is continuous, it commutes with taking limits, thus 
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f (z x =f (in Eas) = lim f È ne < lim F xi f(x) = ¥ x! f(x). 
i=1 mo i=l moe AEI mo i=l i=l 


E 

Proposition 6 Let f > 0 and q > r > Q; then, 
fis q-convex = fis r-convex (22) 
fis r-concave = fis q-concave. (23) 


Proof Implication (22) results from 


f(dxt+(1—A)y) < Mf- < NFA- FO), 
where the first inequality states the q-convexity of f and the second one is valid because f(x), f(y) > 
0 and rt’ >t? > 0, for any t € [0,1] and q > r. The proof of (23) is similar. | 


4.2 Jensen q-Differences 


We now generalize Jensen differences, formalized in Definition 2, by introducing the concept of 
Jensen q-differences. 


Definition 7 Let u Ê (u;);er € [M (X)|7 be a family of finite Radon measures on X, indexed by 
T, and let œ E€ M, (T) be a finite Radon measure on T. For q > 0, define 


revi) £¥( [oy mart) ) = f of ar) 04 
where: 
(i) Y is a concave functional such that dom C M, (Xx); 

(ii) @(t) uy (x) is t-integrable for all x € X; 

(iii) fy O(t) uy dt(t) € dom P; 

(iv) u € domY, for allt € T; 

(v) w(t)? Y(u;) is t-integrable. 
Ifo € M} (T), we call the function defined in (24) a Jensen q-difference. 


Burbea and Rao (1982) established necessary and sufficient conditions on @ for the Jensen 
difference of a @-entropy (see Equation 14) to be convex. The following proposition generalizes 
that result, extending it to Jensen q-differences. 


Proposition 8 Let T and X be finite sets, with |T| = m and |X| =n, and let n € M} (T). Let 
@: [0,1] > R be a function of class C? and consider the (@-entropy, Burbea and Rao, 1982) function 
W : (0, 1]” — R defined as ¥(z) = — £; (zi). Then, the q-difference Ty: (0, 1)" — R is convex 
if and only if ® is convex and —1/@" is (2 — q)-convex. 


The proof is rather long, thus it is relegated to Appendix B. 
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5. The Jensen-Tsallis g-Difference 


This section introduces the Jensen-Tsallis q-difference, a nonextensive generalization of the Jensen- 
Shannon divergence. After deriving some properties concerning the convexity and extrema of these 
functionals, we introduce the notion of joint and conditional Jensen-Tsallis q-difference, a contrast 
measure between stochastic processes. We end the section with a brief asymptotic analysis for the 
extensive case. 


5.1 Definition 


As in Section 3.2, let p be a random probability distribution taking values in {p; } er according to a 
distribution  € M1 (T). Then, we may write 


qv (p) = Y (E[p]) — Eq|¥(p)], 


where the expectations are with respect to m. Hence Jensen q-differences may be seen as defor- 
mations of the standard Jensen differences (16), in which the second expectation is replaced by a 
q-expectation. 

Let Y = S4, the nonextensive Tsallis g-entropy. Introducing the random variables T and X, with 
values respectively in 7T and X, with densities z(t) and p(x) = fy p(x|t)x(t), we have (writing Tis, 
simply as 77") . 


T'(p) = Sq(Elpl) —EqlSq()] 
= 5,(X)— | MOSIT =1) 
= $(X)=S,(XP) 
= 14(X;T), (25) 


where S¿(X|T) is the Tsallis conditional entropy (4), and J,(X;7) is the Tsallis mutual information 
(6), as defined by Furuichi (2006). Observe that (25) is a nonextensive analogue of (17). Since, in 
general, I, # A (see Equation 7), unless q = 1 (in that case, J; = /; = J), there is no counterpart 
of (18) in terms of g-differences. Nevertheless, Lamberti and Majtey (2003) have proposed a non- 
logarithmic version of the JS divergence, which corresponds to using Lj for the Tsallis mutual q- 
entropy (although this interpretation is not explicitly mentioned). 

When X and 7T are finite with |T| = m, we call the quantity T"(p1,...,Pm) the Jensen-Tsallis 
(JT) q-difference of p1,...,Pm With weights 7 ,...,7,,. Although the JT g-difference is a gener- 
alization of the JS divergence, for q Æ 1, the term “divergence” would be misleading in this case, 
since T7 may take negative values (if q < 1) and does not vanish in general if p is deterministic. 


When |T| = 2 and 7 = (1/2, 1/2), define T} = guaz 





nipp) =S, (2) Sq(p1) +S4(p2) 


2 24 
Notable cases arise for particular values of q: 


e For q =0, So(p) = —1 +v(supp(p)), where v(supp(p)) denotes the measure of the support 
of p (recall that p is defined on the measure space (X,.@,v)). For example, if X is finite and 
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v is the counting measure, v(supp(p)) = ||p|lo is the so-called 0-norm (although it is not a 
norm) of vector p, that is, its number of nonzero components. The Jensen-Tsallis 0-difference 





is thus 
To(pi,p2) = -1+v (supp (2772) ) + 1 —v(supp(p1)) + 1 — v(supp(p2)) 
1 +v (supp(p1) Usupp(p2)) — v (supp(p1)) — v (supp(p2)) 
= 1 —v(supp(p1) Nsupp(p2)); (26) 


if X is finite and v is the counting measure, this becomes 


To(P1,P2) =1—||p1 © Pallo, 


where © denotes the Hadamard-Schur (i.e., elementwise) product. We call Tọ the Boolean 
difference. 


e For q = 1, since Sı (p) = H(p), T, is the JS divergence, 
Tı (pi, p2) = JS(P1, p2). 


e For q = 2, S2(p) = 1 — (p, p), where (a,b) = J a(x) b(x) dv(x) is the inner product between 
a and b (which reduces to (a,b) = X;a;ib; if X is finite and v is the counting measure). Con- 
sequently, the Tsallis 2-difference is 


1 


1 
To(p1,p2) = a 5 (p1, p2}, 


which we call the linear difference. 


5.2 Properties of the JT q-Difference 


This subsection presents results regarding convexity and extrema of the JT q-difference, for certain 
values of q, extending known properties of the JS divergence (q = 1). Some properties of the JS 
divergence are lost in the transition to nonextensivity; for example, while the former is nonnegative 
and vanishes if and only if all the distributions are identical, this is not true in general with the JT 
q-difference. Nonnegativity of the JT q-difference is only guaranteed if q > 1, which explains why 
some authors (e.g., Furuichi 2006) only consider values of q > 1, when looking for nonextensive 
analogues of Shannon’s information theory. Moreover, unless g = 1, it is not generally true that 
T;'(p,---;P) = 9 or even that T(p,...,p,p’) = T] (p,.-- p,p). For example, the solution of the 
optimization problem 


in T; 27 
ae 4(P1; P2), (27) 


is, in general, different from p2, unless q = 1. Instead, this minimizer is closer to the uniform 
distribution if q € [0, 1), and closer to a degenerate distribution for q € (1,2] (see Fig. 1). This is not 
so surprising: recall that T2 (p1, p2) = 5 — (pı, p2); in this case, (27) becomes a linear program, 
and the solution is not p{ = p2, but pj = 6;, where j = arg max; pj. 

At this point, we should also remark that, when X is a finite set, the uniform distribution max- 
imizes the Tsallis entropy for any q > 0, which is in fact one of the Suyari axioms underlying the 


Tsallis entropy (see Axiom A2 in Appendix A). 
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Jensen-Tsallis q-Difference to a fixed Bernoulli (p,=0.3) 


























Figure 1: Jensen-Tsallis g-difference between two Bernoulli distributions, pı = (0.3,0.7) and 
p2 = (p,1—p), for several values of the entropic index q. Observe that, for q € [0,1), 
the minimizer of the JT g-difference approaches the uniform distribution (0.5,0.5) as q 
approaches 0; for g € (1,2], this minimizer approaches the degenerate distribution, as 
q— 2. 


We start with the following corollary of Proposition 8, which establishes the joint convexity of 
the JT q-difference, for q € [0,1]. (Interestingly, this “complements” the joint convexity of the JT 
divergence (20), for g € [1,2], proved by Burbea and Rao 1982.) 


Corollary 9 Let T and X be finite sets with cardinalities m and n, respectively. For q € [0,1], the 
JT q-difference is a jointly convex function on MIX). Formally, let {pier andi=1,...,l, be 
a collection of l sets of probability distributions on X; then, for any (Ay,...,4) € Al}, 


l l l ; | 
Á (Eart. aol < Eu TEPP, ph). 
= =l i=] 


Proof Observe that the Tsallis entropy (3) of a probability distribution p; = {p;1,..., Pin} can be 


written as 


x— xi 
Sq(Pr) == > 





4 (Pri) where q(x) = =, 


Ms 


L 


thus, from Proposition 8, T is convex if and only if @, is convex and —1/ o is (2 — q)-convex. 
Since @/ (x) = qx4~, q is convex for x > 0 and q > 0. To show the (2 —q)-convexity of -1/94 (x) = 
—(1/q)x*~4, for x; > 0, and q € [0,1], we use a version of the power mean inequality (Steele, 2006), 


l oa l l 
_ (£ i) <— y (Mixi) 4 a £ win t, 


i=1 i=1 i=1 
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thus concluding that —1/@/ is in fact (2 — q)-convex. a 


A consequence of Corollary 9 is that, for finite X and any q € [0,1], the JT g-difference is 
upper bounded, namely T7 (p1,.--,Pm) < Sq(%). Indeed, since T7 is convex and its domain is the 
Cartesian product of m simplices (a convex polytope), its maximum must occur on a vertex, that is, 
when each argument p, is a degenerate distribution at x,, denoted 6,,. In particular, if |X| > |T], 
this maximum occurs at a vertex corresponding to disjoint degenerate distributions, that is, such that 
xj Ax; ifi Æ j. At this maximum, 


Ty (8x15 +++ Örn ) = Sq (È nas = E Sq (Sx,) 
t=1 t=1 


(Esa, (28) 
t=1 


Sq 
S(T 


where the equality in (28) results from S,(6,,) = 0. (Notice that this maximum may not be achieved 


if |X| < |T|.) The next proposition provides a stronger result: it establishes upper and lower bounds 
for the JT g-difference to any non-negative q and to countable X and T. 


Proposition 10 Let T and X be countable sets. For q > 0, 


T7 ((Pr)rer) < Sq(t), (29) 


and, if |X| > |T], the maximum of Tj is reached for a set of disjoint degenerate distributions. This 
maximum may not be attained if |X| < |T|. 
For q >l, 


Ti ((Pr rer) > 0, 
and the minimum of T; is attained in the purely deterministic case, that is, when all distributions 
are equal to the same degenerate distribution. 
For q € [0,1] and X a finite set with |X| =n, 
TE (Pjer) = Sq(m) [1 —n'~4]. (30) 


This lower bound (which is zero or negative) is attained when all distributions are uniform. 


Proof The proof is given in Appendix C. a 


Finally, the next proposition characterizes the convexity/concavity of the JT g-difference on 
each argument. 


Proposition 11 Let T and X be countable sets. The JT q-difference is convex in each argument, 
for q € [0,2], and concave in each argument, for q > 2. 
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Proof Notice that the JT g-difference can be written as iF Gi Pm) = 
Lj Y(Pij---,Pmj), with 





q 
WY,- --3Ym) = = Ee Ti yi 4 Li - (Ex | . 


It suffices to consider the second derivative of y with respect to yı. Introducing z = )"., Ni yi, 


Py 
oy; 


q Gee -ni (My + ae 
= qm [(my1)** = (ty +z) |: (31) 


Since 1 yı < (T1 yı +z) < 1, the quantity in (31) is nonnegative for q € [0,2] and non-positive for 
q=2. a 


5.3 Joint and Conditional JT ¢-Differences and a Chain Rule 


This subsection introduces joint and conditional JT qg-differences, which will later be used as a 
contrast measure between stochastic processes. A chain rule is derived that relates conditional and 
joint JT q-differences. 


Definition 12 Let X, Y and T be measure spaces. Let (pijer € [M1 (X x Y)|7 be a family of 
measures in M} (X x Y) indexed by T, and let p be a random probability distribution taking values 
in {p: }rer according to a distribution n € M} (T). Consider also: 


e for eacht € T, the marginals p,(Y) € ML (Y), 

e for eacht € T and y € Y, the conditionals p;(X|Y =y) € M! (X), 
o the mixture r(X,Y) = fy n(t) p(X, Y) EML(X x 9), 

e the marginal r(Y) € ML (Y), 

e for each y € Y, the conditionals r(X|Y = y) € M} (X). 


For notational convenience, we also append a subscript to p to emphasize its joint or conditional de- 
pendency of the random variables X and Y, that is, pyy £ p, and Px\y denotes a random conditional 
probability distribution taking values in { p,(.|Y)};eq¢ according to the distribution T. 

For q = 0, we refer to the joint JT q-difference of pxy by 


Ty (pxy) = TF (p) = Sq(r) — Eg,r~n(r)[Sq(Pr)] 


and to the conditional JT q-difference of py by 


T7 (Px) = Eqy~r(Y) [Sg(r(.1¥ =y)) — Eg r~x(T) [Ea.y~p,(¥) [Sq(pr(.|¥ = y))]] , (32) 


where we appended the random variables being used in each q-expectation, for the sake of clarity. 
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Note that the joint JT g-difference is just the usual JT q-difference of the joint random variable 
X xY, which equals (cf. 25) 


T}(pxy) = Sq(X,Y)—Sq(X,¥|T) = 1,(X x ¥;T), (33) 


and the conditional JT g-difference is simply the usual JT g-difference with all entropies replaced 
by conditional entropies (conditioned on Y). Indeed, expression (32) can be rewritten as: 


T}(Pxiy) = Sq(XI¥) —Sq(X|T,¥) = 4(X;T |Y), (34) 


that is, the conditional JT qg-difference may also interpreted as a Tsallis mutual information, as in 
(25), but now conditioned on the random variable Y. 
Note also that, for the extensive case g = 1, (32) may also be rewritten in terms of the conditional 
KL divergences, 
J" (pxiy) = Ti (pxy) = Eryx) [EUY =y))] — Erwacry [Ev~pcvy HeY = y))]] 
Epax(r) [Ey~rv) [Dl =y) =y))I] - 


Proposition 13 The following chain rule holds: 
Ti (pxy) = Ty (Px) + T3 (py) 


Proof Writing the joint/conditional JT g-differences as joint/conditional mutual informations (33- 
34) and invoking the chain rule provided by (4), we have that 


I (X;T|Y) +1,(¥;T) Sq(X|T,Y) — Sq(X|¥) +S |T) —Sq(¥) 


E 


which is the joint JT q-difference associated with the random variable X x Y. E 
Let us now turn our attention to the case where Y = X* for some k € N. In the following, the 


notation (An)nen denotes a stationary ergodic process with values on some finite alphabet .4. 


Definition 14 Let X and T be measure spaces, with X finite, and let F = [(Xn)nen] 7 be a family of 
stochastic processes (taking values on the alphabet X) indexed by T. The k-th order JT q-difference 
of F is defined, fork =1,...,n, as 


Ty" (F) Ê TF (px) 
and the k-th order conditional JT q-difference of F is defined, for k = 1,...,n, as 
Tie (F) È TF (Px), 
and, for k =Q, as a AE TAS (F) = Tj (px). 
Proposition 15 The joint and conditional k-th order JT q-differences are related through: 


Fe ( T Sin Lee T (35) 


Proof Use Proposition 13 and induction. E 
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5.4 Asymptotic Analysis in the Extensive Case 


We now focus on the extensive case (q = 1) for a brief asymptotic analysis of the k-th order joint and 
conditional JT 1-differences (or conditional Jensen-Shannon divergences) when k goes to infinity. 
The conditional Jensen-Shannon divergence was introduced by El-Yaniv et al. (1998) to address 
the two-sample problem for strings emitted by Markovian sources. Given two strings s and f, the 
goal is to decide whether they were emitted by the same source or by different sources. Under 
some fair assumptions, the most likely k-th order Markovian joint source of s and t is governed by 
a distribution 7 given by 
f = argminAD(Psl|r) + (1-A)D Gln). (36) 


where D(.||.) are conditional KL divergences, f, and f, are the empirical (k — 1)-th order condi- 
tionals associated with s and f, respectively, and A = |s|/(|s| + |t|) is the length ratio. The solution 
of the optimization problem is 


Fale) = SAG 
A Ps(c)+ (1-A) P(c) 


A 





Ps(ale) + 





Tp) £1 — 2) Bley Peale)» 


where a € A is a symbol and c € A‘! is a context; this can be rewritten as 7(a,c) = Aps(a,c) + 
(1 —A)p;(a,c); that is, the optimum in (36) is a mixture of f, and f; weighted by the string lengths. 
Notice that, at the minimum, we have 


KR ATR cond,(A,l—A); A a 
AD(fsll*) +O- NDIA = Se" ™ (p,, Br). 


It is tempting to investigate the asymptotic behavior of the conditional and joint JS divergences 
when k — œ; however, unlike other asymptotic information theoretic quantities, like the entropy 
or cross entropy rates, this behavior fails to characterize the sources s and t. Intuitively, this is 
justified by the fact that observing more and more symbols drawn from the mixture of the two 
sources rapidly decreases the uncertainty about which source generated the sample. Indeed, from 
the asymptotic equipartition property of stationary ergodic sources (Cover and Thomas, 1991), we 
have that lim_... +H (px,) = limy +0 H( Px\x,), Which implies 


| eer 1 

lim JS" = lim -JST < lim —H(n) = 0, 
k- 00 ko k k- 00 

where we used the fact that the JS divergence is upper-bounded by the entropy of the mixture 

H(t) (see Proposition 10). Since the conditional JS divergence must be non-negative, we therefore 

conclude that lim;_,..J coun = 0, pointwise. 


6. Nonextensive Mutual Information Kernels 


In this section we consider the application of extensive and nonextensive entropies to define kernels 
on measures; since kernels involve pairs of measures, throughout this section |Z| = 2. Based on 
the denormalization formulae presented in Section 2.3, we devise novel kernels related to the JS 
divergence and the JT q-difference; these kernels allow setting a weight for each argument, thus will 
be called weighted Jensen-Tsallis kernels. We also introduce kernels related to the JR divergence 
(Section 3.3) and the JT divergence (Section 3.4), and establish a connection between the Tsallis 
kernels and a family of kernels investigated by Hein et al. (2004) and Fuglede (2005), placing 
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those kernels under a new information-theoretic light. After that, we give a brief overview of string 
kernels, and using the results of Section 5.3, we devise k-th order Jensen-Tsallis kernels between 
stochastic processes that subsume the well-known p-spectrum kernel of Leslie et al. (2002). 


6.1 Positive and Negative Definite Kernels 

We start by recalling basic concepts from kernel theory (Schélkopf and Smola, 2002); in the fol- 
lowing, X denotes a nonempty set. 

Definition 16 Let @ : X x X — R be a symmetric function, that is, a function satisfying Q(y,x) = 
(x,y), for all x,y E€ X. @ is called a positive definite (pd) kernel if and only if 


n 


n 

2e ciCj P(xi,x;) >20 
i=1 j=l 

foralln € N, x1,...,%, E X and cy,...,C, E R. 


Definition 17 Let y: X x X — R be symmetric. 4 is called a negative definite (nd) kernel if and 
only if 


M= 
M= 


Ci Cj W(Xi,x;) <0 
1 


i=l j 


Joralln EN, x1,...,Xn E€ X and c1,...,Cn E R, satisfying the additional constraint cı +... + Cn =Q. 
In this case, —w is called conditionally pd; obviously, positive definiteness implies conditional 
positive definiteness. 


The sets of pd and nd kernels are both closed under pointwise sums/integrations, the former 
being also closed under pointwise products; moreover, both sets are closed under pointwise con- 
vergence. While pd kernels “correspond” to inner products via embedding in a Hilbert space, nd 
kernels that vanish on the diagonal and are positive anywhere else, “correspond” to squared Hilber- 
tian distances. These facts, and the following propositions and lemmas, are shown in Berg et al. 
(1984). 


Proposition 18 Let y: X x X — R be a symmetric function, and xo E€ X. Let 9: X x X — R be 
given by 
(x,y) = y(x, xo) + W(y, xo) a y(x,y) g W(X0,X0). 


Then, Q is pd if and only if w is nd. 


Proposition 19 The function y : X x X — R is and kernel if and only if exp(—tw) is pd for all 
t>0. 





Proposition 20 The function 4 : X x X — R, is a nd kernel if and only if (t +y)~! is pd for all 
t>0. 


Lemma 21 /f y is nd and nonnegative on the diagonal, that is, W(x,x) > 0 for all x € X, then yw", 
for a € [0,1], and In(1 +y), are also nd. 
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Lemma 22 Jf f: X — R satisfies f > 0, then, for a € [1,2], the function Wa(x,y) = —(f (x) + f(y))™ 
is and kernel. 


The following definition (Berg et al., 1984) has been used in a machine learning context by 
Cuturi and Vert (2005). 


Definition 23 Let (X,@) be a semigroup.” A function @ : X — R is called pd (in the semigroup 
sense) if k : X x X — R, defined as k(x,y) = ọ(x® y), is a pd kernel. Likewise, Q is called nd if k is 
and kernel. Accordingly, these are called semigroup kernels. 


6.2 Jensen-Shannon and Tsallis Kernels 


The basic result that allows deriving pd kernels based on the JS divergence and, more generally, on 
the JT q-difference, is the fact that the denormalized Tsallis g-entropies (10) are nd functions on 
(m$ (X), +), for q € [0,2]. Of course, this includes the denormalized Shannon-Boltzmann-Gibbs 
entropy (8) as a particular case, corresponding to q = 1. Although part of the proof was given by 
Berg et al. (1984) (and by Topsøe 2000 and Cuturi and Vert 2005 for the Shannon entropy case), we 


present a complete proof here. 
Proposition 24 For q € [0,2], the denormalized Tsallis q-entropy Sq is a nd function on (m$ (X), +). 


Proof Since nd kernels are closed under pointwise integration, it suffices to prove that Ọq (see 
Equation 11) is nd on (R4, +). For q #1, Q,(v) = (q — 1)! (y — y1). Let us consider two cases 
separately: if q € [0,1), ,(y) equals a positive constant times —1 +14, where 1(y) = y is the identity 
map defined on R4. Since the set of nd functions is closed under sums, we only need to show that 
both —1 and 1% are nd. Both 1 and —1 are nd, as can easily be seen from the definition; besides, 
since 1 is nd and nonnegative, Lemma 21 guarantees that 1/ is also nd. For the second case, where 

€ (1,2], @g(y) equals a positive constant times 1—12. It only remains to show that —1% is nd for 
q € (1,2]: Lemma 22 guarantees that the kernel k(x,y) = —(x +y)? is nd; therefore —1/ is a nd 
function. 

For q = 1, we use the fact that, 

— x4 


@1(x) = Pg (x) = —xInx = lim XX = lim q(x), 
qo q-1 q—>1 





where the limit is obtained by L’ Hôpital’s rule; since the set of nd functions is closed under limits, 
1 (x) is nd. a 


The following lemma, proved in Berg et al. (1984), will also be needed below. 
Lemma 25 The function G, : R44 — R, defined as G,(y) =y 4 is pd, for q € [0,1]. 


We are now in a position to present the main contribution of this section, which is a family of 
weighted Jensen-Tsallis kernels, generalizing the JS-based (and other) kernels in two ways: (i) they 
allow using unnormalized measures; equivalently, they allow using different weights for each of the 
two arguments; (ii) they extend the mutual information feature of the JS kernel to the nonextensive 
scenario. 





2. Recall that (X,®) is a semigroup if Ẹ is a binary operation in X that is associative and has an identity element. 
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Definition 26 (weighted Jensen-Tsallis kernels) The kernel kq : M“! (X) x M$ (X) — R is defined 
as 


ką(u1, u2) 2 kq(@1p1,@2p2) 
(Sq() a T} (pi; p2)) (@1 +02), 


where pı = u /@, and pz = Uy /@2 are the normalized counterparts of u and i, with corresponding 
masses @1,@2 € R4, and T = (1 /(@1 + @2),@2/ (01 + @2)). 


2 
The kernel kg : (u (xX) \ {0}) — R is defined as 


ka(u1, H2) Ê kq(@1p1,@2p2) = Sq(%) — T7 (p1, P2). 


Recalling (25), notice that S4(1) — T7 (p1, p2) = Sq(T) — l4(X;T) = Sq(T|X) can be interpreted 
as the Tsallis posterior conditional entropy. Hence, kg can be seen (in Bayesian classification terms) 
as a nonextensive expected measure of uncertainty in correctly identifying the class, given the prior 
T = (T1, 72), and a sample from the mixture nı pı + 12p2. The more similar the two distributions 
are, the greater this uncertainty. 


Proposition 27 The kernel ka is pd, for q € [0,2]. The kernel kg is pd, for q € [0,1]. 


Proof With u; = œ@1 pı and py = Mp2 and using the denormalization formula of Remark 1, we ob- 
tain kg (41, u2) = —Sq(u1 +42) + Sq(u1) +Sq(u2). Now invoke Proposition 18 with y = Sq (which is 
nd by Proposition 24), x = u1, y = 42, and xp = O (the null measure). Observe now that kg(u1, 412) = 
kqg(u1,H2)(@1 +@2)~%. Since the product of two pd kernels is a pd kernel and (Proposition 25) 
(@; + @2)~¢ is a pd kernel, for q € [0, 1], we conclude that ką is pd. a 


AS we can see, the weighted Jensen-Tsallis kernels have two inherent properties: they are pa- 
rameterized by the entropic index q and they allow their arguments to be unbalanced, that is, to have 
different weights w;. We now mention some instances of kernels where each of these degrees of 
freedom is suppressed. We start with the following subfamily of kernels, obtained by setting g = 1. 


Definition 28 (weighted Jensen-Shannon kernels) The kernel kyys : (MË? (X)? —> R is defined 
as kwis £ kı, that is, 


kwjs(m, u2) = kwys(@1pı,@2p2) 
= (H(n)—J*(p1,p2))(@1+@2), 


where pı = p1/@1 and p2 = u2/@2 are the normalized counterpart of u; and wo, and T = 
(1 /(@1 +02),@2/ (01 +2)). 
Analogously, the kernel kwys : (MF (X) \ {0})° — R is simply kw s = kı, that is, 
kwgs(m, m2) = kwys(@1p1,@2p2) = H (T) — J” (p1, p2). 


Corollary 29 The weighted Jensen-Shannon kernels kws and kyys are pd. 
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Proof Invoke Proposition 27 with q = 1. | 


The following family of weighted exponentiated JS kernels, generalize the so-called exponenti- 
ated JS kernel, that has been used, and shown to be pd, by Cuturi and Vert (2005). 


Definition 30 (Exponentiated JS kernel) The kernel k gjs : M1 (X) x ML (X) = R is defined, for 
t > 0, as 


kgjs(pı, p2) = exp [—tJS (p1, p2)]- 


Definition 31 (Weighted exponentiated JS kernels) The kernel kwpys : MẸ (X) x M4 (X) > Ris 
defined, for t > 0, as 


kwgss(u, u2) = expltkwys(u1,h2)] 
exp(tH(m))exp[—tJ"(p1, p2)]. (37) 


Corollary 32 The kernels k wrys are pd. In particular, k gjs is pd. 


Proof Results from Proposition 19 and Corollary 29. Notice that although kwpys is pd, none of its 
two exponential factors in (37) is pd. | 


We now keep q € [0,2] but consider the weighted JT kernel family restricted to normalized 
measures, kalou! (x))2* This corresponds to setting uniform weights (œ; = @2 = 1/2); note that in 


this case kg and k, collapse into the same kernel, 


ka(p1, p2) = ka( p1, p2) = Ing(2) — Ty (p1, p2). 


Proposition 27 guarantees that these kernels are pd for q € [0,2]. Remarkably, we recover three 
well-known particular cases for q € {0,1,2}. We start with the Jensen-Shannon kernel, introduced 
and shown to be pd by Hein et al. (2004); it is a particular case of a weighted Jensen-Shannon kernel 
in Definition 28. 


Definition 33 (Jensen-Shannon kernel) The kernel kjg : M! (X) x M1 (X) — R is defined as 
kjs(pPı, p2) = In2 —JS(p1, p2). 

Corollary 34 The kernel kys is pd. 

Proof kys is the restriction of kwyg to M! (X) x M1 (X). a 


Finally, we study two other particular cases of the family of Tsallis kernels: the Boolean and 
linear kernels. 


Definition 35 (Boolean kernel) Let the kernel kgooi : M$ (X) x M3! (X) > R be defined as kooi = 
ko, that is, 

KBooi (P1; P2) = V (supp(p1) N supp(p2)) , 
that is, kgooi( p1, p2) equals the measure of the intersection of the supports (cf. Equation 26). In 
particular, if X is finite and v is the counting measure, the above may be written as 


kgooi(P1;P2) = || p1 © p2llo- 


957 


MARTINS, SMITH, XING, AGUIAR AND FIGUEIREDO 


Definition 36 (Linear kernel) Let the kernel kin : M®?'' (X) x M? (X) — R be defined as 


1 
kin(P1, p2) = 5 (pı, p2). 


Corollary 37 The kernels kpoo and kiin are pd. 


Proof Invoke Proposition 27 with q = 0 and q = 2. Notice that, for q = 2, we just recover the 
well-known property of the inner product kernel (Schélkopf and Smola, 2002), which is equal to 
kiin up to a scalar. E 


In conclusion, the Boolean kernel, the Jensen-Shannon kernel, and the linear kernel are simply 
particular elements of the much wider family of Jensen-Tsallis kernels, continuously parameterized 
by q € [0,2]. Furthermore, the Jensen-Tsallis kernels are a particular subfamily of the even wider 
set of weighted Jensen-Tsallis kernels. 

One of the key features of our generalization is that the kernels are defined on unnormalized 
measures, with arbitrary mass. This is relevant, for example, in applications of kernels on empirical 
measures (e.g., word counts, pixel intensity histograms); instead of the usual step of normalization 
Hein et al. 2004, we may leave these empirical measures unnormalized, thus allowing objects of dif- 
ferent size (e.g., total number of words in a document, total number of image pixels) to be weighted 
differently. Another possibility opened by our generalization is the explicit inclusion of weights: 
given two normalized measures, they can be multiplied by arbitrary (positive) weights before being 
fed to the kernel function. 


6.3 Other Kernels Based on Jensen Differences and q-Differences 


It is worthwhile to note that the Jensen-Rényi and the Jensen-Tsallis divergences also yield positive 
definite kernels, albeit there are not any obvious “weighted generalizations” like the ones presented 
above for the Tsallis kernels. 


Proposition 38 (Jensen-Rényi and Jensen-Tsallis kernels) For any q € [0,2], the kernel 





+ 
(p1, p2) S (2 > 2) 


and the (unweighted) Jensen-Tsallis divergence Js, (20) are nd kernels on M} (X) x M} (X). 
Also, for any q € [0,1], the kernel 





+ 
(pı, p2) Pa (= > 2) 


and the (unweighted) Jensen-Rényi divergence Jp, (19) are nd kernels on M} (X) x M! (X). 


Proof The fact that (p1, p2) +> Sq( %3) is nd results from the embedding x > x/2 and Propo- 


Jr pelea oe) 


sition 24. Since (pı, p2 is trivially nd, we have that Js, is a sum of nd func- 


tions, which turns it nd. To prove the negative definiteness of the kernel (p1,p2) > Ry ( PLP?) 
notice first that the kernel (x,y) — (x + y)/2 is clearly nd. From Lemma 21 and integrating, 
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we have that ae f (eiei is nd for q € [0,1]. From the same lemma we have that 





(pi, p2) > nay (27) )") is nd for any t > 0. Since f (P24 > 0, the nonnegativity of 


(p1; p2) R, (25 A ae by taking the limit t — 0. By the same argument as above, we con- 
clude 1A a is àd m 





As a consequence, we have from Lemma 19 that the following kernels are pd for any q € [0,1] 


and ¢t > 0: 
: pı tp pit p24)" 
keyr(P1,P2) = exp (-r( 2 )) ~ (/( 2 ) 


and its “normalized” counterpart, 








fee 


keyr(P1; p2) = exp(—tr, (P1, p2)) = 
VS pts P3 


Although we could have derived its positive definiteness without ever referring to the Rényi entropy, 
the latter has in fact a suggestive interpretation: it corresponds to an exponentiation of the Jensen- 
Rényi divergence; it generalizes the case g = 1 which corresponds to the exponentiated Jensen- 
Shannon kernel. 

Finally, we point out a relationship between the Jensen-Tsallis divergences (Section 3.4) and a 
family of difference kernels introduced by Fuglede (2005), 


pye Bayt 
Va g(x,y) = ( 2 ) P 7 


Fuglede (2005) derived the negative definiteness of the above family of kernels provided 1 < a < œ 
and 1/2 < B < a; he went further by providing representations for these kernels. Hein et al. (2004) 
used the fact that the integral f Wo,g(x(t),y(t))dt(t) is also nd to derive a family of pd kernels for 
probability measures that included the Jensen-Shannon kernel (see Def. 33). 





We start by noting the following property of the extended Tsallis entropy, which is very easy to 
establish: 


Salu) = TSi jaut) 


= A A 
L xı = yf and x2 = y4, we have that 


Sq i z=) Ga z5) 


= -{s.((S8)") se si 


As a consequence, by making the substitutions r = q 





Js 01,y2) 
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where we introduced 








r r\ l/r 
Js, (x12) = §S, (5 a ) S-(x1) +.S-(x2) 


2 
= e-f 


= +4) UE tx 
2 2 
Since Js, is nd for q € [0,2], we have that Js, is nd for r € (1/2, 0]. 

Notice that while Js, may be interpreted as “the difference between the Tsallis g-entropy of the 
mean and the mean of the Tsallis g-entropies,” Js, may be interpreted as “the difference between 
the Tsallis g-entropy of the g-power mean and the mean of the Tsallis g-entropies.” 

From (38) we have that 





(38) 








f Yag) = (0 Dsl) — (B= DF (4.9), 


so the family of probabilistic kernels studied in Hein et al. (2004) can be written in terms of Jensen- 
Tsallis divergences. 


6.4 k-th Order Jensen-Tsallis String Kernels 


This subsection introduces a new class of string kernels inspired by the k-th order JT g-difference 
introduced in Section 5.3. Although we refer to them as “string kernels,” they are more generally 
kernels between stochastic processes. 

Several string kernels (i.e., kernels operating on the space of strings) have been proposed in the 
literature (Haussler, 1999; Lodhi et al., 2002; Leslie et al., 2002; Vishwanathan and Smola, 2003; 
Shawe-Taylor and Cristianini, 2004; Cortes et al., 2004). These are kernels defined on 4* x A*, 
where A* is the Kleene closure of a finite alphabet 4 (i.e., the set of all finite strings formed by 
characters in A together with the empty string €). The p-spectrum kernel (Leslie et al., 2002) is as- 
sociated with a feature space indexed by A”? (the set of length-p strings). The feature representation 
of a string s, $? (s) = (02 (s) Jucar, counts the number of times each u € AP occurs as a substring of 
S, 

on (s) = H (v1, v2) : s = viuva}. 


The p-spectrum kernel is then defined as the standard inner product in RI#!” 
k&x (8,0) = (® (s), D (r). (39) 
A more general kernel is the weighted all-substrings kernel (Vishwanathan and Smola, 2003), which 


takes into account the contribution of all the substrings weighted by their length. This kernel can be 
viewed as a conic combination of p-spectrum kernels and can be written as 


kwask(s,t) = Ł Apkig(s,t), (40) 
p=1 


where @p is often chosen to decay exponentially with p and truncated; for example, a, = A”, if 
Pmin < P < Pmax, and Qp = 0, otherwise, where 0 < A < 1 is the decaying factor. 
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Both Ree and kwasx are trivially positive definite, the former by construction and the latter 
because it is a conic combination of positive definite kernels. A remarkable fact is that both kernels 
may be computed in O(|s| + |t|) time (i.e., with cost that is linear in the length of the strings), as 
shown by Vishwanathan and Smola (2003), by using data structures such as suffix trees or suffix 
arrays (Gusfield, 1997). Moreover, with s fixed, any kernel k(s,t) may be computed in time O(|t|), 
which is particularly useful for classification applications. 

We will now see how Jensen-Tsallis kernels may be used as string kernels. In Section 5.3, we 
have introduced the concept of joint and conditional JT q-differences. We have seen that joint JT 
q-differences are just JT g-differences in a product space of the form X = Xı x Xo; for k-th order 
joint JT q-differences this product space is of the form .A* = 4 x A‘!. Therefore, they still yield 
positive definite kernels as those introduced in Definition 26, where X = 4“. The next definition 
and proposition summarize these statements. 


Definition 39 (k-th order weighted JT kernels) Let Z (A) be the set of stationary and ergodic 
stochastic processes that take values on the alphabet A. For k € N and q € [0,2], let the kernel 
kax : (Ry x F(A)? = R be defined as 


Kg k((1,51), (@2,52)) = Kq(@1Psy.k; O2P sy.) (41) 
= (So(z) = TH *(s1,50)) (1 + @2)!, 
where Ps, x and ps, are the k-th order joint probability functions associated with the stochastic 


sources sı and s2, and T = (@1/(@, +02), @2/ (01 + @2)). 
Let the kernel kq x : (R44 x Z (A)? > R be defined as 


kq k((@1,81), (@2,52)) ka (1 Psik: ®2Ps2,k) (42) 


= (s(n) = TH" (s1,50)) 
Proposition 40 The kernel kak is pd, for q € [0,2]. The kernel kq x is pd, for q € [0,1]. 


Proof Define the map g: R4 x F(A) — R+ x M),**(ak) as (@,5) +> g(@,5) = (©, psx). From 
Proposition 27, the kernel kg(g(@1,51),g(@2,52)) is pd and therefore so is kg.¢((@1,51), (@2,52)); 
proceed analogously for kg x. a 


At this point, one might wonder whether the “k-th order conditional JT kernel” i that would 


be obtained by replacing Te ay with pos ™ in (41-42) is also pd. Formula (35) ow that such 

“conditional JT kernel” is a difference bagen two joint JT kernels, which is inconclusive. The 
following proposition shows that r and ka are not pd in general. The proof, which is in 
Appendix D, proceeds by building a RO oe 


Proposition 41 Let en be defined as keond (5, $9) £ (saln) — a) (@, + @2)1; and 


Ke" be defined as K®(s1,s2) Ê (saln) ET s2)). It holds that K and K are not pd 


in general. 
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Despite the negative result in Proposition 41, the chain rule in Proposition 15 still allows us to 
define pd kernels by combining conditional JT q-differences. 


Proposition 42 Let (Bx)xen be a non-increasing infinitesimal sequence, that is, satisfying 
Bo > Bi >... > Bn — 0 
Any kernel of the form 
E Bekaa” (43) 
k=0 
is pd for q € [0,2]; and any kernel of the form 
Y Be kos 
k=0 i 
is pd for q € |0, 1], provided both series above converge pointwise. 


Proof From the chain rule, we have that (defining the 0-th order joint JT q-difference as kao £0) 


s z n Š 

L Bekok. = Y Br (karti — kax) = lim $ Oekg e+ Bakgner = Yo Orka x (44) 
k=0 k=0 k=1 k=1 

with œg = Bg_1 — Bx (the term lim Bakar 1 was dropped because B,, — 0 and ey is bounded). 

Since (Bx)xen is non-increasing, we have that (0z)en\4o} is non-negative, which makes (44) the 

pointwise limit of a conic combination of pd kernels, and therefore a pd kernel. The proof for 

Ezo Pakea! is analogous. E 


Notice that if we set Bo = ... = Bx_1 = 1 and B; = 0, Vj > k, in the above proposition, we 
recover the k-th order joint JT q-difference. 

Finally, notice that, in the same way that the linear kernel is a special case of a JT kernel when 
q = 2 (see Cor. 37), the p-spectrum kernel (39) is a particular case of a p-th order joint JT kernel, 
and the weighted all substrings kernel (40) is a particular case of a combination of joint JT kernels 
in the form (43), both obtained when we set q = 2 and the weights œ; and œ2 equal to the length 
of the strings. Therefore, we conclude that the JT string kernels introduced in this section subsume 
these two well-known string kernels. 


7. Experiments 


We illustrate the performance of the proposed nonextensive information theoretic kernels, in com- 
parison with common kernels, for SVM-based text classification. We performed experiments with 
two standard data sets: Reuters-21578° and WebKB.* Since our objective was to evaluate the ker- 
nels, we considered a simple binary classification task that tries to discriminate among the two 
largest categories of each data set; this led us to the earn-vs-acq classification task for the first data 
set, and stud-vs-fac (students’ vs. faculty webpages) in the second data set. Two different frame- 
works were considered: modeling documents as bags of words, and modeling them as strings of 
characters. Therefore, both bags of words kernels and string kernels were employed for each task. 





3. Available at www.daviddlewis.com/resources/testcollections. 
4. Available at www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data. 
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7.1 Documents as Bags of Words 


For the bags of words framework, after the usual preprocessing steps of stemming and stop-word re- 
moval, we mapped text documents into probability distributions over words using the bag-of-words 
model and maximum likelihood estimation; this corresponds to normalizing the term frequencies 
(tf) using the @;-norm, and is referred to as tf (Joachims, 2002; Baeza-Yates and Ribeiro-Neto, 
1999). We also used the tf-idf (term frequency-inverse document frequency) representation, which 
penalizes terms that occur in many documents (Joachims, 2002; Baeza-Yates and Ribeiro-Neto, 
1999). To weight the documents for the Tsallis kernels, we tried four strategies: uniform weighting, 
word counts, square root of the word counts, and one plus the logarithm of the word counts; how- 
ever, for both tasks, uniform weighting revealed the best strategy, which may be due to the fact that 
documents in both collections are usually short and do not differ much in size. 

As baselines, we used the linear kernel with 42 normalization, commonly used for this task 
(Joachims, 2002), and the heat kernel approximation introduced by Lafferty and Lebanon (2005): 


an 1 
Kheat(P1,P2) = (47t)? exp (-3 a2(p..P2)) ; 


where t > 0 and d (p1, p2) = 2arccos (£; VPiiPa)- Although Lafferty and Lebanon (2005) provide 
empirical evidence that the heat kernel outperforms the linear kernel, it is not guaranteed to be pd 
for an arbitrary choice of t, as we show in Appendix E. This parameter and the SVM C parameter 
were tuned by cross-validation over the training set. The SVM-Light package (available at http: 
//svmlight.joachims.org/) was used to solve the SVM quadratic optimization problem. 

Figures 2-3 summarize the results. We report the performance of the Tsallis kernels as a func- 
tion of the entropic index q. For comparison, we also plot the performance of an instance of a Tsallis 
kernel with q tuned by cross-validation. For the first task, this kernel and the two baselines exhibit 
similar performance for both the tf and the tf-idf representations; differences are not statistically 
significant. In the second task, the Tsallis kernel outperformed the 2-normalized linear kernel for 
both representations, and the heat kernel for ¢f-idf; the differences are statistically significant (using 
the unpaired ż test at the 0.05 level). Regarding the influence of the entropic index, we observe that 
in both tasks, the optimal value of q is usually higher for ¢f-idf than for tf. 

The results on these two problems are representative of the typical relative performance of the 
kernels considered: in almost all tested cases, both the heat kernel and the Tsallis kernels (for a 
suitable value of q) outperform the 2-normalized linear kernel; the Tsallis kernels are competitive 
with the heat kernel. 


7.2 Documents as Strings 


In the second set of experiments, each document is mapped into a probability distribution over 
character p-grams, using maximum likelihood estimation; we did experiments for p = 3,4,5. To 
weight the documents for the p-th order joint Jensen-Tsallis kernels, four strategies were attempted: 
uniform weighting, document lengths (in characters), square root of the document lengths, and 
one plus the logarithm of the document lengths. For the earn-vs-acq task, all strategies performed 
similarly, with a slight advantage for the square root and logarithm of the document lengths; for 
the stud-vs-fac task, uniform weighting revealed the best strategy. For simplicity, all experiments 
reported here use uniform weighting. 


963 


Figure 2: 


Figure 3: 
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Results for earn-vs-acq using tf and tf-idf representations. The error bars represent +1 
standard deviation on 30 runs. Training (resp. testing) with 200 (resp. 250) samples per 
class. 
















































tf tf-idf 
11 
10 WJTK-q WJTK-q 
Sanita ee LinL2 10 ia LinL2 
Heat (Coy) — _ Heat (5,y) 
gll — WJTK (doy) 7 — — WJTK (doy) 















































ain = wa jee aed 


















































Average error rate (%) 
(ee) 

Average error rate (%) 
œœ 


























5 
0 0.250.50.75 1 1.251.51.75 2 
Entropic index q 


0 0.250.50.75 1 1.251.51.75 2 
Entropic index q 


Results for stud-vs-fac using tf and tf-idf representations. The error bars represent +1 
standard deviation on 30 runs. Training (resp. testing) with 200 (resp. 250) samples per 
class. 





As baselines, we used the p-spectrum kernel (PSK, see 39) for the values of p referred above, 
and the weighted all substrings kernel (WASK, see 40) with decaying factor tuned to 4 = 0.75 
(which yielded the best results), with pmin = p set to the values above, and pmax = œ. The SVM C 
parameter was tuned by cross-validation over the training set. 

Figures 4-5 summarize the results. 

For the first task, the JT string kernel and the WASK outperformed the PSK (with statistical 
significance for p = 3), all kernels performed similarly for p = 4, and the JT string kernel outper- 
formed the WASK for p = 5; all other differences are not statiscally significant. In the second task, 
the JT string kernel outperformed both the WASK and the PSK (and the WASK outperformed the 
PSK), with statistical significance for p = 3,4,5. Furthermore, by comparing Figures 3 and 5, we 
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Figure 4: Results for earn-vs-acq using string kernels and p = 3,4,5. The error bars represent +1 
standard deviation on 15 runs. Training (resp. testing) with 200 (resp. 250) samples per 
class. 


also observe that the 5-th order JT string kernel remarkably outperforms all bags of words kernels 
for the stud-vs-fac task, even though it does not use or build any sort of language model at the word 
level. 


8. Conclusions 


In this paper we have introduced a new family of positive definite kernels between measures, which 
includes previous information-theoretic kernels on probability measures as particular cases. One of 
the key features of the new kernels is that they are defined on unnormalized measures (not necessar- 
ily normalized probabilities). This is relevant, for example, for kernels on empirical measures (such 
as word counts, pixel intensity histograms); instead of the usual step of normalization (Hein et al., 
2004), we may leave these empirical measures unnormalized, thus allowing objects of different 
sizes (e.g., documents of different lengths, images with different sizes) to be weighted differently. 
Another possibility is the explicit inclusion of weights: given two normalized measures, they can 
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Figure 5: Results for stud-vs-fac using string kernels and p = 3,4,5. The error bars represent +1 


standard deviation on 15 runs. Training (resp. testing) with 200 (resp. 250) samples per 
class. 


be multiplied by arbitrary (positive) weights before being fed to the kernel function. In addition, 


we define positive definite kernels between stochastic processes that subsume well-known string 
kernels. 


The new kernels and the proofs of positive definiteness rely on other main contributions of this 
paper: the new concept of g-convexity, for which we proved a Jensen q-inequality; the concept 
of Jensen-Tsallis q-difference, a nonextensive generalization of the Jensen-Shannon divergence; 
denormalization formulae for several entropies and divergences. 


We have reported experiments in which these new kernels were used in support vector machines 
for text classification tasks. Although the reported experiments do not lead to strong conclusions, 
they show that the new kernels are competitive with the state-of-the-art, in some cases yielding a 
significant performance improvement. 


Future research will concern applying Jensen-Tsallis g-differences to other learning problems, 
like clustering, possibly exploiting the fact that they accept more than two arguments. 
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Appendix A. Suyari’s Axioms for Nonextensive Entropies 


Suyari (2004) proposed the following set of axioms (above referred as Suyari’s axioms) that deter- 
mine nonextensive entropies of the form stated in (1). Below, q > 0 is any fixed scalar and f} is a 
function defined on Wats 


(Al) Continuity: fg|,qn-1 is continuous, for any n € N; 
(A2) Maximality: For any n € N and (p1,..., pn) € A” !," 


Ll Dist Pn) < S4(1/n,...,1/n); 


(A3) Generalized additivity: For i = 1,...,n, j = 1,...,mi, pij > 0, and p; = Xi Pij 


fal Pii,- --, Pnm,) = Jalpi» ., Pn) + 
y pity (2, Pit): 
i=l Pi Pi 


(A4) Expandability: Fa Dis ae -;Pn,0) = fy(pi,- os Pn). 


Appendix B. Proof of Proposition 8 


Proof The case q = 1 corresponds to the Jensen difference and was proved by Burbea and Rao 
(1982) (Theorem 1). Our proof extends that to q Æ 1. Let y = (y1,. --, Ym), where y; = (Yri; -< -Ym ). 
Thus 


aw (y) 


y (Ean) -Ex y 
slie] 


= 


showing that it suffices to consider n = 1, where each y; € (0, 1], that is, 


Tiy (1,-++s¥m) -Èx P) -0(Ean): ; 
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this function is convex on [0, 1|” if and only if, for every fixed a1, ...,am € [0,1], and b1,...,bm E€ R, 
the function 


f (x) = Tyla + bix, -- -am + bmx) 
is convex in {x € R: a,+,x € [0,1], t =1,...,m}. Since f is C?, it is convex if and only if f” (t) > 0. 
We first show that convexity of f (equivalently of Tiy) implies convexity of @. Letting c; = 
at + bx, 


=$ xito ee o-(§ nb) “($ me). (45) 
t=1 


By choosing x = 0, a, = a € (0, 1], for t = 1,...,m, and by,...,bm satisfying ¥, Tb; = 0 in (45), we 
get 


m 
Ma) nibs 
t=1 


hence, if f is convex, 9” (a) > 0 thus @ is convex. 
Next, we show that convexity of f also implies (2 — q)-convexity of —1/@”. By choosing x = 0 


» "(p"(ar)) |, we get 


(thus c; = ar) and b; = T; 
m —q m —q 2 m 
Biei (Eatey) © ie) 


rm o" (a;) & g” 


1 m m4 m S 


where the expression inside the square brackets is the Jensen (2 — q)-difference of 1/ọ” (see Def- 
inition 7). Since @” (x) > 0, the factor outside the square brackets is non-negative, thus the Jensen 
(2 — q)-difference of 1/@” is also nonnegative and —1/@” is (2 — q)-convex. 

Finally, we show that if @ is convex and —1/@” is (2 —q)-convex, then f” > 0, thus T% aw is 








convex. Let r; = (qm; “/@"(c;))!/? and s; = b;(4Q" (c:)/q)'/?; then, non-negativity of f” results 
from the following chain of inequalities/equalities: 


(46) 


Q 
IA 
— 
1M: 
So” 
— 
IMs 
Se i 
l 
ATE 
M 
> 
el 
SS 
N 





m 2-q m 2 
= 7 AAG - (Ebm (47) 
mo 
2 
1 
< ——— Y brio" bın 48 
< aaa Ea" e) - (Sm Ga 
1 


= P © Tcr) mG (t), (49) 


where: (46) is the Cauchy-Schwarz inequality; Equality (47) results from the definitions of r; and 
s; and from the fact that r,s; = b;7; Inequality (48) states the (2 — g)-convexity of —1/ọ”; equality 
(49) results from (45). E 
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Appendix C. Proof of Proposition 10 
Proof The proof of (29), for q > 0, results from 


eat BGs) -Ëa (1- Ef) 
_ saben) 


“Ge q-1 5 Ee 
where the inequality holds since, for y; > 0: if q > 1, then Y;y? < (Y; yi)‘; if q € [0,1], then Y;y7 > 
(Livi)*. 
The proof that Ts > 0 for q => 1, uses the notion of g-convexity. Since X is countable, the Tsallis 
entropy is as in (2), thus S4 > 0. Since —S, is 1-convex, then, by Proposition 6, it is also g-convex 
for q > 1. Consequently, from the g-Jensen inequality (Proposition 5), for finite J, with |T| =m 


m m 
T; (P1,-+-;Pm) = Sq (i nn) = Y 17 Sq (Pr) 20 
t=1 t=1 


Since S, is continuous, so is Ts thus the inequality is valid in the limit as m — co, which proves the 
assertion for J countable. Finally, T7 (61, ...,01,-..) = 0, where 6; is some degenerate distribution. 
Finally, to prove (30), for g € [0,1] and X finite, 


m m 
TF (Pi: <- Pm) = Sy (i nn) — $ m S(p) 
t=1 t=1 


T7 (Pi,- . ,Pm) 





hs 


< S(T), 


2 YU Sq(Pr) — $ TS lp) (50) 
t=1 t=1 

= $ (m —xf)Sq(pr) 
t=1 

> SU) (m — 7) (51) 


= S,(n)[1—n'~4, 


where the Inequality (50) results from S, being concave, and the Inequality (51) holds since 1, — 
nt? <0, for q € (0, 1], and the uniform distribution U maximizes S,, with S,(U) = (1—n!~4)/(q—1). 
E 


Appendix D. Proof of Proposition 41 


Proof We show a counterexample with q = 1 (the extensive case), m = (1/2,1/2) and k = 1, 


that discards both cases. It suffices to show that y. 1S v POY 21/2) violates the triangle 
inequality for some choice of stochastic processes sj, 52,53 and therefore is a not a squared distance; 
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this in turn implies that YI Seona is not nd and, from Proposition 18, that the above two kernels 
are not pd. We define s1,52,53 to be stationary first order Markov processes in a binary alphabet 
A = {0,1} defined by the following transition matrices, respectively: 


S, — lim l-e e | č | 1 0 
‘20! 1/4 3/4] | 1/4 3/4 |’ 


| 3/4 1/4 


e 1-€ 


So = lim = 
e—0 


ee ale 


ss = tim] 1/4 a | 7 | ja aal 


whose stationary distributions are 





© = lim EE E 
1—04 sel AON 


P E : 

4 40 1p4e|, 1 ak 
jie : tlh AZ 
o> gag 5 de | 44e |7 | 4/5 | 


The matrix of first order conditional JT 1-differences (or first order conditional Jensen-Shannon 
divergences) is 





and 


a HC) 0 0 0.390 
x 0 SH(3)—2H(1) |=| « 0 0.128 |, ee) 
me 0 x x 0 


which fails to be negative definite, since 





y 15%°4(s1,52) + 159°" (52,53) < 18% (51,53), 


which violates the triangle inequality required for ,/JS$°"4 to be a metric. 


Interestingly, the 0-th order conditional Jensen-Shannon divergence matrix (this one ensured to 
be negative definite because it equals a standard Jensen-Shannon divergence matrix) is 


0 1 H(2)—5H(z) 0 1 0.610 
* 0 A(7)-5H(4) |=|» 0 0.108 |. (53) 
* x 0 * Ok 0 


From the chain rule (35), we have that the sum of the matrices (52) and (53) is the second order 
joint Jensen-Shannon divergence, and therefore is also guaranteed to be negative definite. E 
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Appendix E. The Heat Kernel Approximation 


The diffusion kernel for statistical manifolds, recently proposed by Lafferty and Lebanon (2005), is 
grounded in information geometry (Amari and Nagaoka, 2001). It models the diffusion of “informa- 
tion” over a statistical manifold according to the heat equation. Since in the case of the multinomial 
manifold (the relative interior of A”), the diffusion kernel has no closed form, the authors adopt 
the so-called “first-order parametrix expansion,” which resembles the Gaussian kernel replacing the 
Euclidean distance by the geodesic distance that is induced when the manifold is endowed with a 
Riemannian structure given by the Fisher information (we refer to Lafferty and Lebanon 2005 for 
further details). The resulting heat kernel approximation is 


a 1 
Kheat(P1,P2) = (47t)? exp (-3 a2(p1.P2)) , 


where t > 0 and d,(p1, p2) = 2arccos (£; VPiiP2i)- Whether Kheat is pd has been an open problem 
(Hein et al., 2004; Zhang et al., 2005). Let S" be the positive orthant of the n-dimensional sphere, 
that is, 
n+l 
Sil = ET € Rt | La =h Vix; > o) ` 
i=l 
The problem can be restated as follows: is there an isometric embedding from S" to some Hilbert 
space? In this section we answer that question in the negative. 


Proposition 43 Letn > 2. For sufficiently large t, the kernel Kyeq is not pd. 
Proof From Proposition 19, kpeat is pd, for all t > 0, if and only if d is nd. We provide a coun- 


terexample, using the following four points in A*: pı = (1,0,0), p2 = (0,1,0), p3 = (0,0,1) and 
pa = (1/2,1/2,0). The squared distance matrix [Dj] = [d} (pi, p;)] is 


ae al 
T |4041 
EE eae 

1140 


Taking c = (—4, —4, 1,7) we have c’ Dc = 27? > 0, showing that D is not nd. Although p1, p2, p3, p4 
lie on the boundary of A’, continuity of d? implies that it is not nd on the relative interior of A*. The 
case n > 2 follows easily, by appending zeros to the four vectors above. | 
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